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Abstract. An approximate sparse recovery system consists of parameters k, N, an m-hj-N mea- 
^\ surement matrix, and a decoding algorithm, D. Given a vector, x, the system approximates x 

by x = D(4>x), which must satisfy ||x — x|| 2 < C ||x — Xfc|L, where x^ denotes the optimal fc-term 
approximation to x. For each vector x, the system must succeed with probability at least 3/4. 
Among the goals in designing such systems are minimizing the number m of measurements and the 
runtime of the decoding algorithm, D. 

In this paper, we give a system with m = 0(k log(iV/fc)) measurements — matching a lower 
bound, up to a constant factor — and decoding time 0(fclog c N), matching a lower bound up to 
log(iV) factors. We also consider the encode time (i.e., the time to multiply <& by x), the time to 
update measurements (i.e., the time to multiply $ by a 1-sparse x), and the robustness and stability 
i | of the algorithm (adding noise before and after the measurements). Our encode and update times 

are optimal up to log(iV) factors. The columns of <& have at most 0(log 2 (fc) log(N/k)) non-zeros, 
each of which can be found in constant time. If x is an exact fc-sparse signal and v\ and are 
arbitrary vectors (regarded as noise), then, setting x = D(3>(x + v±) + v-i), we get 
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x-x|| 2 < 2||i/i|| 2 +log(fc) 1 



t-H where 1 1 <t> 1 1 2 ^ 2 is a natural scaling factor that makes our result comparable with previous results. 

(The log(fc) factor above, improvable to log 1 / 2 " 1 " ' 1 ' k, makes our result (slightly) suboptimal when 
V2 7^ 0.) We also extend our recovery system to an FPRAS. 

(N 

1. Introduction 

Tracking heavy hitters in high- volume, high-speed data streams |4 , monitoring changes in data 
Q\ streams [5], designing pooling schemes for biological tests |TU] (e.g., high throughput sequencing, 

testing for genetic markers), localizing sources in sensor networks |15| HH] are all quite different 
J> technological challenges, yet they can all be expressed in the same mathematical formulation. We 

have a signal x of length N that is sparse or highly compressible; i.e., it consists of k significant 
entries ( "heavy hitters" ) which we denote by x^ while the rest of the entries are essentially negligible. 
We wish to acquire a small amount information (commensurate with the sparsity) about this signal 
in a linear, non-adaptive fashion and then use that information to quickly recover the significant 
entries. In a data stream setting, our signal is the distribution of items seen, while in biological group 
testing, the signal is proportional to the binding affinity of each drug compound (or the expression 
level of a gene in a particular organism) . We want to recover the identities and values only of the 
heavy hitters which we denote by x&, as the rest of the signal is not of interest. Mathematically, 
we have a signal x and an m-by-iV measurement matrix $ with which we acquire measurements 
y = 3>x, and, from these measurements y, we wish to recover x, with 0{k) entries, such that 

1 1 X X 1 1 o ^ C 1 1 X Xj^ 1 1 2 • 
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FIGURE 1 . Summary of the best previous results and the result obtained in this paper. 



Our goal, which we achieve up to constant or log factors in the various criteria, is to design 
the measurement matrix $ and the decoding algorithm in an optimal fashion: (i) we take as 
few measurements as possible m = 0{k\og{N /k)), (ii) the decoding algorithm runs in sublin- 
ear time 0{k\og{N /k)), and (iii) the encoding and update times are optimal 0(Nlog(N/k)) and 
0{k\og{N / k)) , respectively. In order to achieve this, our algorithm is a randomized algorithm; i.e., 
we specify a distribution on the measurement matrix $ and we guarantee that, for each signal, the 
algorithm recovers a good approximation with high probability over the choice of matrix. 

In the above applications, it is important both to take as few measurements as possible and 
to recover the heavy hitters extremely efficiently Measurements correspond to physical resources 
(e.g., memory in data stream monitoring devices, number of screens in biological applications) 
and reducing the number of necessary measurements is critical these problems. In addition, these 
applications require efficient recovery of the heavy hitters — we test many biological compounds 
at once, we want to quickly identify the positions of entities in a sensor network, and we cannot 
afford to spend computation time proportional to the size of the distribution in a data stream 
application. Furthermore, Do Ba, et al. [2] give a lower bound on the number of measurements for 
sparse recovery Q(klog(N/k)). There are polynomial time algorithms |13| [3] [12] meet this lower 
bound, both with high probability for each signal and the stronger setting, with high probability for 
all sig nalQ Previous sublinear time algorithms, whether in the "for each" model [HE] or in the "for 
all" model [llj, however, used several additional factors of log(TV) measurements. We summarize 
the previous sublinear algorithms in the "for each" signal model in Figure [TJ The column sparsity 
denotes how many Is there are per column of the measurement matrix and determines both the 
decoding and measurement update time and, for readability, we suppress O(-). The approximation 
error signifies the metric we use to evaluate the output; either the £2 or £\ metric. In this paper, 
we focus on the £2 metric. 

We give a joint distribution over measurement matrices and sublinear time recovery algorithms 
that meet this lower bound (up to constant factors) in terms of the number of measurements and 
are within log(/c) factors of optimal in the running time and the sparsity of the measurement matrix. 

Theorem 1. There is a joint distribution on matrices and algorithms, with suitable instantiations 
of anonymous constant factors, such that, given measurements <&x = y, the algorithm returns x 
and approximation error 

||x — x|| 2 < 2 ||^i|| 2 

with probability 3/4. The algorithm runs in time 0(klog c (N)) and has 0{k\og{N /k)) rows. 

Furthermore, our algorithm is a fully polynomial randomized approximation scheme. 

Theorem 2. There is a joint distribution on matrices and algorithms, with suitable instantiations 
of anonymous constant factors (that may depend on e, such that, given measurements <&x = y, the 
algorithm returns x and approximation error 

||x-x|| 2 < (1 + e) ||i/i|| 2 



albeit with different error guarantees and different column sparsity depending on the error metric. 
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with probability 3/4. The algorithm runs in time 0((k/e)log c (N)) and has 0((k/e)log(N/k)) 
rows. 

Finally, our result is robust to corruption of the measurements by an arbitrary noise vector Vi , 
which is an important feature for such applications as high throughput screening and other physical 
measurement systems. (It is less critical for digital measurement systems that monitor data streams 
in which measurement corruption is less likely.) When / 0, our error dependence is on V2 is 
suboptimal by the factor log(A;) (improvable to log 1//2+o( ^ 1 - ) A;). Equivalently, we can use log(A;) times 
more measurements to restore optimality. 

Theorem 3. There is a joint distribution on matrices and algorithms, with suitable instantiations 
of anonymous constant factors (that may depend on e), such that, given measurements $x + v<± = 
V + v i, the algorithm returns x and approximation error 

||x - x|| 2 < (1 + e) ||^i|| 2 + elog(fc) ..ffi 2 

H V ll2~+2 

with probability 3/4. The algorithm runs in time 0{{k/ e)\og c {N)) and has 0(k/elog(N/k)) 
rows. 

Previous sublinear algorithms begin with the observation that if a signal consists of a single 
heavy hitter, then the trivial encoding of the positions 1 through N with log(iV) bits, referred to 
as a bit tester, can identify the position of the heavy hitter. The second observation is that a 
number of hash functions drawn at random from a hash family are sufficient to isolate enough of 
the heavy hitters, which can then be identified by the bit tester. Depending on the type of error 
metric desired, the hashing matrix is pre- multiplied by random ±1 vectors (for the £2 metric) in 
order to estimate the signal values. In this case, the measurements are referred to as the Count 
Sketch in the data stream literature j3] and, without the premultiplication, the measurements are 
referred to as Count Median [HIE] and give £\ < C£\ error guarantees. In addition, the sublinear 
algorithms are typically greedy, iterative algorithms that recover portions of the heavy hitters with 
each iteration or that recover portions of the £2 (or £\) energy of the residual signal. 

We build upon the Count Sketch design but incorporate the following algorithmic innovations 
to ensure an optimal number of measurements: 

• With a random assignment of N signal positions to 0{k) measurements, we need to encode 
only 0{N/k) positions, rather than N as in the previous approaches. So, we can reduce 
the domain size which we encode. 

• We use a good error-correcting code (rather than the trivial identity code of the bit tester) . 

• Our algorithm is an iterative algorithm but maintains a compound invariant: the number of 
un-discovered heavy hitters decreases at each iteration while, simultaneously, the required 
error tolerance and failure probability become more stringent. Because there are fewer 
heavy hitters to find at each stage, we can use more measurements to meet more stringent 
guarantees. 

In Section [2] we detail the matrix algebra we use to describe the measurement matrix distribution 
which we cover in Section [3j along with the decoding algorithm. In Section [4] we analyze the 
foregoing recovery system. 

2. Preliminaries 

2.1. Vectors. Let x denote a vector of length N. For each k < N, let x& denote either the usual 
fc'th component of x or the signal of length iV consisting of the j largest- magnitude terms in x; it 
will be clear from context. The signal x& is the best fc-term representation of x. The energy of a 

1 ■ II II 2 I |2 

signal x is ||x|| 2 = 2^ i=1 |xj| . 
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FIGURE 2. Matrix algebra used in constructing an overall measurement matrix. 
The last column contains both the output dimensions of the matrix operation and 
its construction formula. 



2.2. Matrices. In order to construct the overall measurement matrix, we form a number of differ- 
ent types of combinations of constituent matrices and to facilitate our description, we summarize 
our matrix operations in Table [2] The matrices that result from all of our matrix operations have 
AT columns and, with the exception of the semi-direct product of two matrices ix r , all operations are 
performed on matrices A and B with A" columns. A full description can be found in the Appendix. 



3. Sparse recovery system 



In this section, we specify the measurement matrix and detail the decoding algorithm. 

3.1. Measurement matrix. The overall measurement matrix, is a multi-layered matrix with 
entries in {— 1,0, +1}. At the highest level, <1> consists of a random permutation matrix P left- 
multiplying the row direct sum of (lg(&)) summands, each of which is used in a separate 
iteration of the decoding algorithm. Each summand is the row direct sum of two separate 
matrices, an identification matrix, and an estimation matrix, E^\ 

$(i) 



3> 



where = E^® X D^\ 



In iteration j, the identification matrix D^) consists of the row direct sum of 0(j) matrices, all 
chosen independently from the same distribution. We construct the distribution (CW K t B^')QS^ 
as follows: 

• For j = 1,2, .. . ,lg(k), the matrix B^) is a Bernoulli matrix with dimensions kc^-by-N, 
where c is an appropriate constant 1/2 < c < 1. Each entry is 1 with probability G (l/ (kc 3 )) 
and zero otherwise. Each row is a pairwise independent family and the set of row seeds is 
fully independent. 

• The matrix C^' is an encoding of positions by an error-correcting code with constant 
rate and relative distance. That is, fix an error-correcting code and encoding/decoding 
algorithm that encodes messages of (log log N) bits into longer codewords, also of length 
0(loglogAQ, and can correct a constant fraction of errors. The i'th column of C^> is 
the direct sum of 0(loglog N) copies of 1 with the direct sum of E(ii),E(i2), • • • , where 
ii,»2j • • • are blocks of O (log log N) bits each whose concatenation is the binary expansion 
of i and E{-) is the encoding function for the error-correcting code. The number of columns 
in C^' matches the maximum number of non-zeros in B^\ which is approximately the 
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expected number, O (c'iV/fc), where c < 1. The number of rows in is the logarithm of 
the number of columns, since the process of breaking the binary expansion of index i into 
blocks has rate 1 and encoding by E(-) has constant rate. 
• The matrix S"' is a pseudorandom sign-flip matrix. Each row is a pairwise independent 
family of uniform ±l-valued random variables. The sequence of seeds for the rows is a fully 
independent family. The size of matches the size of cWktB®. 

Note that error correcting encoding often is accomplished by a matrix- vector product, but we are 
not encoding a linear error-correcting code by the usual generator matrix process. Rather, our 
matrix explicitly lists all the codewords. The code may be non-linear. 
The identification matrix at iteration j is of the form 



D® = 





(CWx. t B®)oSW 


i 








o(j). 



In iteration j, the estimation matrix E^' consists of the direct sum of O(j) matrices, all chosen 
independently from the same distribution, B'^QS'^\ so that the estimation matrix at iteration 
j is of the form 







i 








o(j). 



The construction of the distribution is similar to that of the identification matrix, but omits the 
error-correcting code and uses different constant factors, etc., for the number of rows compared 
with the analogues in the identification matrix. 

• The matrix B'^ is Bernoulli with dimensions 0(kd)-by-N, for appropriate c, 1/2 < c < 1. 
Each entry is 1 with probability (l/(fcc J )) and zero otherwise. Each row is a pairwise 
independent family and the set of seeds is fully independent. 

• The matrix S'^ is a pseudorandom sign-flip matrix of the same dimension as B'v\ Each 
row of S'U' is a pairwise independent family of uniform ±l-valued random variables. The 
sequence of seeds for the rows is a fully independent family. 

3.2. Measurements. The overall form of the measurements mirrors the structure of the measure- 
ment matrices. We do not, however, use all of the measurements in the same fashion. In iteration 
j of the algorithm, we use the measurements y^' = 3>( J )x. As the matrix «l>w) = E^'® T D^\ 

we have a portion of the measurements ic"' = .D^x that we use for identification and a portion 
z (j) = E (j) x that 

we use for estimation. The w™> portion is further decomposed into measurements 
[v^\u^\ corresponding to the run of (9(loglog N) l's in C^' and measurements corresponding to 
each of the blocks in the error-correcting code. There are 0{j) i.i.d. repetitions of everything at 
iteration j. 

3.3. Decoding. The decoding algorithm is shown in Figure [3] in the Appendix. 

4. Analysis 

In this section we analyze the decoding algorithm for correctness and efficiency 
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4.1. Correctness. Let x = x^. + v\ where we assume x is normalized so that ||^i|| 2 = 1 and xj. is 
the vector x with all but the largest-magnitude k entries zeroed out. Our goal is to guarantee an 
approximation x with approximation error ||x — x|| 2 < (1 + e) H^il^ + e ||^2 1 1 2 - But observe that 
is a different type of object from x or x; v<i is added to 3>x. For the main theorem to make sense, 
therefore, we need to normalize We discuss this now. 

Observe that the matrix $ can be scaled up by an arbitrary constant factor c > 1 which can be 
undone by the decoding algorithm: Let D' be a new decoding algorithm that calls the old decoding 
algorithm D as follows: D'(y) = T> f-y), so that D'(c$x + z^) = T> (<&x + -v?). Thus we can 
reduce the effect of v-i by an arbitrary factor c and so citing performance in terms of ||f2 1| alone is not 
sensible. Note also that and x are different types of objects; as an operator, takes an object 
of the type of x and produces an object of the type of V2- We will stipulate that the appropriate 
norm of $ be bounded by 1, in order to make our results quantitatively comparable with others. 
Our error guarantee is in £2 norm, so we should use a 2-operator norm; i.e.,, max||$a;|| 2 over x 
with || a; || 2 = 1. But our algorithm's guarantee is in the "for each" signal model, so we need to 
modify the norm slightly. 



the 



Definition 4. The ||<I>|| 2 .^ 2 norm of a randomly- constructed matrix $ is max^E 
smallest M such that, for all x with \\x\\ 2 = 1, we have \\Qx\\ 2 < M except with probability 1/4- 

Now we bound ||$|| 2 ^ 2 . Each row p of a Bernoulli(p) matrix with sign flips, BQS, satisfies 

E[|p£c| ] = p||j;|| 2 . So 1/p such rows satisfy ||(S0S)£c|| 2 < O ( ||a:|| 2 ]. Our matrix $ repeats the 

above j times in the j'th iteration, j < log 2 (/c), and combines it with an error-correcting code 
matrix of 0(log(A^/A;)) dense rows. It follows that 

11*11^ = 0(log 2 (A:)log(iV/A ; )). 

We are ready to state the main theorem. 



Theorem [3] Consider the matrices in Section \3.1\ and the algorithms in Section 3.3 (that share 



randomness with the matrices). The joint distribution on those matrices and algorithms, with 
suitable instantiations of anonymous constant factors (that may depend on e), are such that, given 
measurements 3>x + v% = y + ^2 , the algorithm returns x with approximation error 

||x - x|| 2 < (1 + e) IHI2 + elo g( fc ) ||l? 

Il v ll2~»2 

with probability 3/4. The algorithm runs in time k\og c N and $ has 0{k\og{N /k)) rows. 

In this extended abstract, we give the proof only for e = 1. Our results generalize in a straightfor- 
ward way for general e > (roughly, by replacing k with k/e at the appropriate places in the proof) 
and the number of measurements is essentially optimal in e. Because our approach builds upon 
the Count Sketch approach in we omit the proof of intermediary steps that have appeared 
earlier in the literature. 

We maintain the following invariant. At the beginning of iteration j, the residual signal has the 
form 

'3^ 



(Loop Invariant) = + vf^ with 



x (i) 



k 

< — , and 
- V 



,U) 



< 2 , 
2 ~ V4 



except with probability |(1 — i^)-'), where ||-|| is the number of non-zero entries. The vector x^) 
consists of residual elements of x^. Clearly, maintaining the invariant is sufficient to prove the 
overall result. In order to show that the algorithm maintains the loop invariant, we demonstrate 
the following claim. 

Claim 1. Let b^' be the vector we recover at iteration j. 
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The vector contains all but at most residual elements o/xj^ 
The vector b^) contains at most residual elements of Xj. with ' 
The total sum square error over all "good" estimates is at most 



with "good" estimates, 
bad" estimates. 



Proof. To simplify notation, let T be the set of un-recovered elements of x& at iteration j; i.e., the 
support of x"'. We know that \T\ < kj2? . The proof proceeds in three steps. 

Step 1. Isolate heavy hitters with little noise. Consider the action of a Bernoulli sign-flip 
matrix BQS with 0{k/2 :i ) rows. From previous work [U H], it follows that, if constant factors 
parametrizing the matrices are chosen properly, 

Lemma 5. For each row p of B, the following holds with probability 0(1): 

• There is exactly one element t of T "hashed" by B; i.e., there is exactly one t £ T with 
Pt = l- 

• There are 0(N ■ 2 3 /k) total positions (out of N) hashed by B. 



The dot product (pQS)r^ is S t r\ J > ± O 



So) 



,(o) 



Proof. (Sketch.) For intuition, note that the estimator St(pQS)r^> is a random variable with mean 

Then the third claim and the first two claims assert that the expected 



So) 



r^' and variance 



,0) 



behavior happens with probability 0(1). □ 

In our matrix the number of rows is not k/2 J but kc 3 for some c, 1/2 < c < 1. Take 

c = 2/3. We obtain a stronger conclusion to the lemma. The dot product (pQS)r^) is 



S t ri j) ± O 



1 



k(2/3)o 



O) 



StrP ± I 



(3/4) 



i 2 l 
k 



provided constants are chosen properly. Our lone hashed heavy hitter t will dominate the dot 
product provided 



Si) 



We show in the remaining steps that we can likely recover such heavy hitters; i.e., Identify 
identifies them and Estimate returns a good estimate of their values. There are at most (k/2 J ) 



heavy hitters of magnitude less than | ( (3/4) J ^- 



,00 



to estimate but they contribute a total of | 
next round (which still meets our invariant). 



(3/4)' 



which we will not be able to identify nor 



Si) 



noise energy to the residual for the 



Step 2. Identify heavy hitters with little noise. Next, we show how to identify t. Since 
there are N/k®^ positions hashed by B^\ we need to learn the 0(log(N/k)) bits describing t in 
this context. Previous sublinear algorithms [HE] used a trivial error correcting code, in which the 
t'th column was simply the binary expansion of t in direct sum with a single 1. Thus, if the signal 
consists of xj in the t'th position and zeros elsewhere, we would learn x^ and x^ times the binary 
expansion of t (the latter interpreted as a string of O's and l's as real numbers). These algorithms 
require strict control on the failure probability of each measurement in order to use such a trivial 
encoding. In our case, each measurement succeeds only with probability 0(1) and, generally, fails 
with probability 0(1). So we need to use a more powerful error correcting code and a more reliable 
estimate of \xA. 
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To get a reliable estimate of |xt|, we use the b = 0(loglog iV)-parailel repetition code of all Is. 
That is, we get b independent measurements of |xf| and we decode by taking the median. Let p 
denote the success probability of each individual measurement. Then we expect the fraction p to 
be approximately correct estimates of |xt|, we achieve close to the expectation, and we can arrange 
that p > 1/2. It follows that the median is approximately correct. We use this value to threshold 
the subsequent measurements (i.e., the bits in the encoding) to 0/1 values. 

Now, let us consider these bit estimates. In a single error-correcting code block of b = 0(log log N) 
measurements, we will get close to the expected number, bp, of successful measurements, except 
with probability l/log(A r ), using the Chernoff bound. In the favorable case, we get a number of 
failures less than the (properly chosen) distance of the error-correcting code and we can recover 
the block using standard nearest-neighbor decoding. The number of error-correcting code blocks 
associated with t is 0(log(iV/A;)/loglog N) < O(logiV), so we can take a union bound over all 
blocks and conclude that we recover t with probability 0(1). The invariant requires that the failure 
probability decrease with j. Because the algorithm takes 0{j) parallel independent repetitions, we 
guarantee that the failure probability decreases with j by taking the union over the repetitions. 

We summarize these discussions in the following lemma. We refer to these heavy hitters in the 
list A as the j-large heavy hitters. 

Lemma 6. Identify returns a set A of signal positions that contains at least 3/4 of the heavy 
hitters in T, \T\ < k/2^ , that have magnitude at least | ^(3/4) J ^- v^f* 

We also observe that our analysis is consistent with the bounds we give on the additional mea- 
surement noise V2- The permutation matrix P in <1> is applied before zaj is added and then P 1 is 
applied after by the decoding algorithm. It follows that we can assume vi is permuted at random 
and, therefore, by Markov's inequality, each measurement gets at most an amount of noise energy 

proportional to its fair share of Hz-^H^- Thus, If there are m = 0(/c log N/k) measurements, each 

II II 2 

measurement gets 2 noise energy and identification succeeds anyway provided the lone heavy 

I II 2 

hitter t in that bucket has square magnitude at least ^ > so the at most k smaller heavy hitters, 
that we may miss, together contribute energy fc ^^ 2 = O ( lo g( jy/ 2 ^ 1 • If we recall the definition and 
value of || <&|| 2^2' we see t na, t this error meets our bound. 



Step 2. Estimate heavy hitters. Many of the details in this step are similar to those in Lemma[5] 
(as well as to previous work as the function Estimate is essentially the same as Count Sketch), 
so we give only a brief summary. 

First, we discuss the failure probability of the Estimate procedure. Each estimate is a complete 

failure with probability 1 — 0(1) and the total number of identified positions is O MA; (2/3) J 'Y 

Because we perform j parallel repetitions in estimation, we can easily arrange to lower that failure 

probability, so we assume that the failure probability is at most ^(3/4)-'^, and that we get 

approximately the expected number of (nearly) correct estimates. There are k(2/3) J heavy hitters 
in A, so the expected number of failures is (l/4)(k/2 J ). These, along with the at most l/4(fc/2 J ) 
missed j-large heavy hitters, will form x( J+1 ), the at-most-/c/2 J+1 residual heavy hitters at the next 
iteration. 

In iteration j, Identity returns a list A with k(2/3) J heavy hitter position identified. A group of 
k{2/?>y measurements in yields estimates for the positions in A with aggregate £2 error ±0(1), 
additively. An additional O ^(4/3) J ^ times more measurements, 0(k(8/9)^) in all, improves the 

estimation error to (1/8) (3/4)- ? , additively. These errors, together with the omitted heavy hitters 
that are not j-large and i/w) form the new noise vector at the next iteration, v^ +1 \ 
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Finally, consider the effect of f 2 . We would like to argue that, as in the identification step, the 

II II 2 

noise vector f 2 is permuted at random and each measurement is corrupted by ^ ; where m = 
@(klog(N/k)) is the number of measurements, approximately its fair share of ||z^2|| 2 - Unfortunately, 
the contributions of v<i to the various measurements are not independent as z/2 is permuted, so we 
cannot use such a simple analysis. Nevertheless, they are negatively correlated and we can achieve 
the result we want using [9]. The total li squared error of the corruption over all 0{k) estimates 
is ||f2|| 2 /log(iV//«), which will meet our bound. That is, since ||$|| 2 ^ 2 = 0(log 2 (/c) log(N/k)), the 
U2 contribution to the error is 

( IHI 2 \ = {iog(k) |HI 2 



l^ll2~+2 



as claimed, whence we read off the factor, log(/c) (improvable to log 1 / 2 ^ 1 ) k), which is directly 
comparable to other results that scale $ properly. 

□ 

4.2. Efficiency. 

4.2.1. Number of Measurements. The analysis of isolation and estimation matrices are similar; the 
number of measurements in isolation dominates. 

The number of measurements in iteration j is computed as follows. There are 0{j) parallel 
repetitions in iteration j. They each consist of /c(2/3) J measurements arising out of B^> for 
identification times 0(log(N/k)) measurements for the error correcting code, plus k(2/3) 3 times 
0((4/3) J ) for estimation. This gives 

e ^jk \o g (N/k) + jk = fcio g (jv/fc) Q + o(i) X 3 

Thus we have a sequence bounded by a geometric sequence with ratio less than 1. The sum, over 
all j, is 0(klog(N/k)). 

4.2.2. Encoding and Update Time. The encoding time is bounded by N times the number of non- 



zeros in each column of the measurement matrix. This was analyzed above in Section 4.1 there are 
log 2 (fc)log(7V/fc) non-zeros per column, which is suboptimal by the factor log 2 (fc). By comparison, 
some proposed methods use dense matrices, which are suboptimal by the exponentially-larger factor 
k. This can be improved slightly, as follows. Recall that we used j parallel repetitions in iteration 
j, j < log(/c), to make the failure probability at iteration be; e.g., 2~ 3 , so the sum over j is bounded. 
We could instead use failure probability l/j 2 , so that the sum is still bounded, but the number of 
parallel repetitions will be log(j'), for j < log(fc). This results in log(fc) log log (fc) log(N/k) non-zeros 

per column and z/ 2 contribution to the noise equal to y/\og(k) loglog(fc) ||^|j^ 2 



I2~ 



We can use a pseudorandom number generator such as i \— > [(ai + b mod d) / B\ for random a 
and 6, where B is the number of buckets. Then we can, in time O(l), determine into which bucket 
any i is mapped and determined the i'th element in any bucket. 

Another issue is the time to find and to encode (and to decode) the error-correcting code. Observe 
that the length of the code is 0(loglog N). We can afford time exponential in the length, i.e., time 
log 01 - 1 ) N, for finding and decoding the code. These tasks are straightforward in that much time. 

4.2.3. Decoding Time. As noted above, we can quickly map positions to buckets and find the i'th 
element in any bucket, and we can quickly decode the error-correcting code. The rest of the claimed 
runtime is straightforward. 
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5. Conclusion 

In this paper, we construct an approximate sparse recovery system that is essentially optimal: 
the recovery algorithm is a sublinear algorithm (with near optimal running time), the number of 
measurements meets a lower bound, and the update time, encode time, and column sparsity are 
each within log factors of the lower bounds. We conjecture that with a few modifications to the 
distribution on measurement matrices, we can extend this result to the l\ < Ci\ error metric 
guarantee. We do not, however, think that this approach can be extended to the "for all" signal 
model (all current sublinear algorithms use at least one factor O(logiV) additional measurements) 
and leave open the problem of designing a sublinear time recovery algorithm and a measurement 
matrix with an optimal number of rows for this setting. 
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6. Appendix 

We have a full description of the matrix algebra defined in Table |2j 

• Row direct sum. The row direct sum A® r B is a matrix with N columns that is the 
vertical concatenation of A and B. 

• Element- wise product. If A and B are both r x N matrices, then AQB is also anrxJV 
matrix whose (i, j) entry is given by the product of the entries in A and B. 

• Semi-direct product. Suppose A is a matrix of r% rows (and iV columns) in which each 
row has exactly h non-zeros and B is a matrix of r2 rows and h columns. Then Bt< T A is 
the matrix with r\T2 rows, in which each non-zero entry a of A is replaced by a times the 
j'th column of B, where a is the j'th non-zero in its row. This matrix construction has 
the following interpretation. Consider (Sx r A)x where A consists of a single row, p, with 
h non-zeros and x is a vector of length N. Let y = p0x be the element-wise product of p 
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and x. If p is 0/1-valued, y picks out a subset of x. We then remove all the positions in 
y corresponding to zeros in p, leaving a vector y' of length h. Finally, (Bx r A)x is simply 
the matrix-vector product By', which, in turn, can be interpreted as selecting subsets of 
y, and summing them up. Note that we can modify this definition when A has fewer than 
h non-zeros per row in a straightforward fashion. 
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Recover($,j/) 
Output: x = approximate representation of x 

y = P l y 

o(°) = 

For j = to O(logfc) { 
y = y- P _1 *a^) 
split = u)W® r zW 
A= Identify(Z)(- ? ), m/- 7 ')) 
b (j) = estimate^'), A) 

a (J+l) = a (i) + b (J) 

} 



Identify(D^) , tw^')) 
Output: A= list of positions 

A = 

Divide into sections [v,u] of size 0(log(c J (N/k))) 

For each section { 

u = median(|i>£|) 

For each £ // threshold measurements 

u t = Q{u t -u/2) // @(u) = 1 if u>0, 8(it) = otherwise 

Divide w into blocks hi of size O(loglogAT) 
For each b% 

fii = Decode(6j) // using error-correcting code 

A =Integer(/?i, (3 2 , • • •) // integer rep'ed by bits /?i,/3 2 ,... 

A = AU{A} 

} 



Estimate(^'),z^),A) 
Output: b= vector of positions and values 
5 = 

For each A G A 

b A =median £ S . t-B (i) =1 (^ 0) sg) 

Figure 3. Pseudocode for the overall decoding algorithm. 



